Advanced Graphics and Data Visualization in R Live HTML

Lecture 05: Dimension Transformation

0.1.0 An overview of Advanced Graphics and Data Visualization in R

“Advanced Graphics and Data Visualization in R” is brought to you by the Centre for the Analysis of Genome Evolution & Function’s (CAGEF) bioinformatics training initiative. CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. Many of the datasets and examples used in this course will be drawn from real-world datasets and the techniques learned herein aim to be broadly applicable to multiple fields.

This lesson is the fifth in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.

The structure of the class is a code-along style in R markdown notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto datatools Hub so students can program along with the instructor.


0.2.0 Lecture objectives

This week will focus on exploring the relationships within your data observations comparing between experimental samples and applying a suite of cluster, dimension reduction and projection algorithms.

At the end of this lecture you will have covered visualizations related to:

  1. Clustering with heatmaps and dendrograms
  2. K-means clustering
  3. Principal component analysis
  4. Non-linear projections

0.3.0 A legend for text format in R markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced

Yellow box: Risk or caution

Green boxes: Recommended reads and resources to learn R

Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.


0.4.0 Lecture and data files used in this course

0.4.1 Weekly Lecture and skeleton files

Each week, new lesson files will appear within your RStudio folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto datatools Hub. You will need to use your UTORid credentials to complete the login process. From there you will find each week’s lecture files in the directory /2025-03-Adv_Graphics_R/Lecture_XX. You will find a partially coded skeleton.Rmd file as well as all of the data files necessary to run the week’s lecture.

Alternatively, you can download the R-Markdown Notebook (.Rmd) and data files from the RStudio server to your personal computer if you would like to run independently of the Toronto tools.

0.4.2 Live-coding HTML page

A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!

0.4.3 Post-lecture Files

At the end of each lecture there will be a completed version of the lecture code released as an HTML file under the Modules section of Quercus.

0.4.4 Data used in this lesson

Today’s datasets will focus on the smaller datasets we worked on in earlier lectures (namely our Ontario public health unit COVID-19 demographics data), and a new set of RNAseq analysis on different tissue samples from COVID-19 patients

0.4.4.1 Dataset 1: phu_demographics_census_norm.csv

This is a combination of datasets from previous lectures. This incorporates PHU demographic data with StatsCan census data from 2017 to produce a normalized estimate of cases, deaths, and hospitalizations across age groups and public health units in Ontario.

0.4.4.2 Dataset 2: Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv

This is the same readcount data we looked at in Lecture 04. RNA-Seq read count data generated from SARS-CoV2 infections of AEC cells. Used to compare the timecourse of expression (pro-inflammatory) changes in samples treated with and without HSP90 inhibitors. Published in iScience doi: https://doi.org/10.1016/j.isci.2021.102151

0.4.4.3 Dataset 3: GSE150316_DeseqNormCounts_final.txt

From Desai et al., 2020 on medRxiv doi: https://doi.org/10.1101/2020.07.30.20165241 this dataset has normalized expression counts from RNAseq data. It covers multiple samples and tissues from COVID-positive patients with a focus on lung tissue. The expression data has been unfiltered for SARS-CoV-2 expression data as well.

0.4.4.4 Dataset 4: 2020.07.30.20165241-supp_tables.xlsx

From Desai et al., 2020 on medRxiv doi: https://doi.org/10.1101/2020.07.30.20165241 this dataset contains patient information like viral load from tissues that were used for RNAseq.


0.5.0 Packages used in this lesson

tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2

viridis helps to create color-blind palettes for our data visualizations

RColorBrewer has some hlepful palettes that we’ll need to colour our data.

gplots will be used to help generate heatmap/dendrogram visualizations.

FactoMineR and factoextra will be used for PCA generation and visualization

Rtsne and umap are packages implementing non-linear projection algorithms

# Some packages can be installed via Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ComplexHeatmap")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
##     CRAN: https://cran.r-project.org
## Bioconductor version 3.12 (BiocManager 1.30.22), R 4.0.5 (2021-03-31)
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'ComplexHeatmap'
## Installation paths not writeable, unable to update packages
##   path: C:/Program Files/R/R-4.0.5/library
##   packages:
##     boot, class, cluster, codetools, crayon, evaluate, foreign, KernSmooth,
##     lattice, mgcv, nlme, nnet, pbdZMQ, rlang, rpart, spatial, survival
## Old packages: 'abind', 'ade4', 'ape', 'askpass', 'backports', 'basefun',
##   'bdsmatrix', 'bench', 'BiocManager', 'bit', 'bit64', 'bitops', 'brew',
##   'brio', 'broom', 'bslib', 'cachem', 'Cairo', 'callr', 'car', 'classInt',
##   'cli', 'clue', 'commonmark', 'corrplot', 'covr', 'cowplot', 'coxme',
##   'credentials', 'crosstalk', 'curl', 'data.table', 'DBI', 'dbplyr',
##   'dendextend', 'DEoptimR', 'desc', 'digest', 'downlit', 'dplyr', 'dreamerr',
##   'DT', 'e1071', 'fansi', 'fastmap', 'fixest', 'foghorn', 'fontawesome', 'fs',
##   'future', 'gee', 'geomtextpath', 'gert', 'ggnewscale', 'ggplot2', 'ggsci',
##   'gh', 'git2r', 'glmmTMB', 'globals', 'glue', 'haven', 'highr', 'htmltools',
##   'htmlwidgets', 'httpuv', 'httr2', 'hunspell', 'ISwR', 'jpeg', 'jsonlite',
##   'knitr', 'Lahman', 'later', 'leaps', 'lintr', 'listenv', 'lme4', 'lubridate',
##   'maps', 'markdown', 'MatrixModels', 'matrixStats', 'mclust', 'mice',
##   'microbenchmark', 'mime', 'minqa', 'mlt', 'mockery', 'mratios', 'multcomp',
##   'mvtnorm', 'nloptr', 'odbc', 'openssl', 'ordinal', 'pander', 'parallelly',
##   'parsedate', 'pillar', 'pingr', 'pixmap', 'pkgbuild', 'pkgdown', 'pkgload',
##   'plyr', 'polyclip', 'prettyunits', 'processx', 'profvis', 'progress',
##   'promises', 'ps', 'purrr', 'quantreg', 'R.oo', 'R.utils', 'ragg', 'Rcpp',
##   'RcppArmadillo', 'RcppEigen', 'RcppTOML', 'RCurl', 'readr', 'readxl',
##   'remotes', 'renv', 'repr', 'reprex', 'reticulate', 'rhub', 'rjson',
##   'RMariaDB', 'rmarkdown', 'RMySQL', 'robustbase', 'roxygen2', 'RPostgres',
##   'RPostgreSQL', 'rprojroot', 'rsconnect', 'RSpectra', 'RSQLite', 'rstudioapi',
##   'rvest', 'rzmq', 's2', 'sandwich', 'sass', 'sessioninfo', 'sf', 'shiny',
##   'showtext', 'SimComp', 'sp', 'SparseM', 'spatstat.data', 'spatstat.geom',
##   'spatstat.random', 'spatstat.utils', 'spelling', 'splancs', 'stringi',
##   'stringr', 'survPresmooth', 'sys', 'sysfonts', 'systemfonts', 'testthat',
##   'textshaping', 'TH.data', 'tidyr', 'timechange', 'tinytex', 'TMB', 'tram',
##   'tzdb', 'ucminf', 'units', 'usethis', 'utf8', 'uuid', 'V8', 'vctrs',
##   'vdiffr', 'vipor', 'viridis', 'vroom', 'waldo', 'webutils', 'wk', 'writexl',
##   'xfun', 'xml2', 'xopen', 'xts', 'yaml', 'zip', 'zoo'
install.packages("FactoMineR", dependencies = TRUE)
## Installing package into 'C:/Users/mokca/AppData/Local/R/win-library/4.0'
## (as 'lib' is unspecified)
## Warning: dependency 'emmeans' is not available
## also installing the dependencies 'FactoInvestigate', 'missMDA', 'Factoshiny'
## 
##   There are binary versions available but the source versions are later:
##                  binary source needs_compilation
## FactoInvestigate    1.7    1.9             FALSE
## missMDA            1.18   1.19             FALSE
## Factoshiny          2.4    2.7             FALSE
## FactoMineR          2.4   2.11              TRUE
## installing the source packages 'FactoInvestigate', 'missMDA', 'Factoshiny', 'FactoMineR'
## Warning in install.packages("FactoMineR", dependencies = TRUE): installation of
## package 'FactoMineR' had non-zero exit status
## Warning in install.packages("FactoMineR", dependencies = TRUE): installation of
## package 'FactoInvestigate' had non-zero exit status
## Warning in install.packages("FactoMineR", dependencies = TRUE): installation of
## package 'missMDA' had non-zero exit status
## Warning in install.packages("FactoMineR", dependencies = TRUE): installation of
## package 'Factoshiny' had non-zero exit status
install.packages("factoextra", dependencies = TRUE)
## Warning: package 'factoextra' is in use and will not be installed
install.packages("Rtsne")
## Warning: package 'Rtsne' is in use and will not be installed
install.packages("umap")
## Warning: package 'umap' is in use and will not be installed
# ------------ Legacy installation code which we hopefully never need again ----------#
# We need to specifically install a package called pbkrtest for the facto packages
# This is due to using an older verion of R 
# library(remotes)
# install_version("pbkrtest", "0.5.1")
# Packages to help tidy our data
library(tidyverse)
library(readxl)
library(magrittr)

# Packages for the graphical analysis section
library(viridis)
# library(gplots) # heatmap2()
library(ComplexHeatmap) # Heatmap()
library(RColorBrewer)

# Useful for PCA and PCA visualization
library(FactoMineR)
## Error in library(FactoMineR): there is no package called 'FactoMineR'
library(factoextra)

# Data projection packages
library(Rtsne)
library(umap)

1.0.0 Data categorization, dimension reduction, and dimension transformation

Last week we looked at an analysis of RNAseq data through a number of methods starting with broad-level volcano plots and moving towards gene-level expression visualizations with dotplots. In between we stopped to take a look at heatmaps. In this instance we simply used heatmaps to convey expression levels of multiple genes across one or more samples.

Looking more broadly, we now wish to ask questions such as “how similar are our samples?”, “can our samples be grouped or categorized in some way?” and “is there an underlying structure or architecture to our data?” We briefly discussed scatterplot matrices to help analyse our sample quality among replicate experiments.

We can study these questions using a number of techniques that range from clustering/categorization to projection of high-dimensional data onto a lower set of dimensions (usually 2 or 3).


1.1.0 What kind of data do we care about?

We cannot begin our journey until we talk about the nature of the data we are interested in examining. Usually, our data will consist of many samples (observations) with some number of features (variables) captured about each observation. For example, with RNAseq data we could consider the measurements we capture for each gene as a separate feature/variable/column.

Conversely you may have hundreds or thousands of samples you’d like to categorize in some way (or show that you can categorize) with just a smaller set of features. For every nut, there is a tool of sorts for cracking it!

We’ll start with a COVID dataset collected by Ontario. It consists of a collection of demographic case data from across 34 public health units (PHUs) from January 2020 to March 2022. Observations are broken down by PHU and age group. The demographics data has been modified to include a set of normalized values (individuals per 100k) that was calculated based on StatsCan 2017 census data. Modeling off of these data, the 2022 sizes for each age group were estimated. Case, death, and hospitalization counts were normalized based on these subgroup sizes to generate values per 100K individuals. We’ll be working with this 2022 dataset mainly because it utilizes some fine-grained binning of age-groups and is an appropriate size for first investigating the idea of clustering data.

The dataset will be found in phu_demographics_census_norm.csv and we’ll use it to guide us through two sections of today’s lecture.

Let’s begin by loading the data and taking a look at its structure.

# Import the normalized demographics data
covid_demographics_norm.df = read_csv("data/phu_demographics_census_norm.csv")
## Rows: 238 Columns: 15
## -- Column specification -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## Delimiter: ","
## chr  (4): from_date, to_date, public_health_unit, age_group
## dbl (11): total_cases, total_deaths, total_hospitalizations_count, percent_c...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Look at it's structure
str(covid_demographics_norm.df, give.attr = FALSE)
## spc_tbl_ [238 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ from_date                   : chr [1:238] "January 15, 2020" "January 15, 2020" "January 15, 2020" "January 15, 2020" ...
##  $ to_date                     : chr [1:238] "March 2, 2022" "March 2, 2022" "March 2, 2022" "March 2, 2022" ...
##  $ public_health_unit          : chr [1:238] "Algoma" "Algoma" "Algoma" "Algoma" ...
##  $ age_group                   : chr [1:238] "0 to 4" "12 to 19" "20 to 39" "40 to 59" ...
##  $ total_cases                 : num [1:238] 181 412 1868 1424 379 ...
##  $ total_deaths                : num [1:238] 0 0 1 5 0 14 13 0 0 2 ...
##  $ total_hospitalizations_count: num [1:238] 9 5 31 48 0 69 44 10 2 35 ...
##  $ percent_cases               : num [1:238] 0.0343 0.0781 0.3543 0.2701 0.0719 ...
##  $ percent_deaths              : num [1:238] 0 0 0.0303 0.1515 0 ...
##  $ percent_hospitalizations    : num [1:238] 0.0437 0.0243 0.1505 0.233 0 ...
##  $ population_2022             : num [1:238] 117840 117840 117840 117840 117840 ...
##  $ individuals_2022            : num [1:238] 5481 9230 24676 33111 7751 ...
##  $ cases_per_100k              : num [1:238] 3302 4464 7570 4301 4890 ...
##  $ deaths_per_100k             : num [1:238] 0 0 4 15 0 46 177 0 0 5 ...
##  $ hospitalizations_per_100k   : num [1:238] 164 54 126 145 0 228 598 116 13 95 ...

1.2.0 Looking for trends/groups in your data

Let’s begin looking at our covid_demographics_norm.df. Using a series of data sets, we’ve created a consolidated dataset with:

  1. Cumulative cases, deaths, and hospitalizations due to COVID-19 within each age group per PHU.

  2. Representation of each age group within each data category as a percent of total incidents in each PHU.

  3. Using 2017 census data, the number of cases per 100,000 individuals normalized by estimated population size for each age group within each PHU.

The question we want to answer is: of the 34 public health units, which look most similar based on the normalized case data for each age group? In order to visualize this data, we’ll want to convert our current long-form data that looks something like this:

public_health_unit age_group total_cases cases_per_100k deaths_per_100k hospitalizations_per_100k
Algoma 0 to 4 181 3302 0 164
Algoma 12 to 19 412 4464 0 54

Into something like this:

public_health_unit population_2022 category 0 to 4 5 to 11 12 to 19 20 to 39 40 to 59 60 to 79 80+
Algoma 117840 cases 3302 4464 7570 4301 4890 2331 4116

We need to do the following to the dataset

  1. Select just the variables we are interested in.
  2. pivot out the age group data into its own columns so we have 1 observation (row) per PHU.
  3. Correct the position of the age groups (5 to 11 needs to be moved).
# Create a wide-format version of our normalized data
covid_demographics_norm_wide.df <-

  # Start by passing along the long-form normalized data
  covid_demographics_norm.df %>% 

  # Select for just the PHU, age_group, population size and cases/hosp/deaths_per_100k
  dplyr::select(c(3,4,11,13:15)) %>% 

  # Pivot the data to a longer format
  pivot_longer(cols = -c(1:3), names_to = "category", values_to = "per_100k") %>% 

  # Get rid of the suffix of "_per_100k"
  mutate(category = str_remove_all(string = category, pattern = "_per_100k")) %>% 

  # Pivot the age_group/per_100k data out wider
  pivot_wider(names_from = age_group,
              values_from = per_100k
             ) %>% 

  # Move the "5 to 11" category to after "0 to 4". You could use a longer "select" call to do this too!
  relocate(., `5 to 11`, .after=`0 to 4`)

# Take a look at our resulting dataframe
head(covid_demographics_norm_wide.df)
## # A tibble: 6 x 10
##   public_health_unit population_2022 category      `0 to 4` `5 to 11` `12 to 19`
##   <chr>                        <dbl> <chr>            <dbl>     <dbl>      <dbl>
## 1 Algoma                      117840 cases             3302      4890       4464
## 2 Algoma                      117840 deaths               0         0          0
## 3 Algoma                      117840 hospitalizat~      164         0         54
## 4 Brant County                153558 cases             4481      5376       6499
## 5 Brant County                153558 deaths               0         0          0
## 6 Brant County                153558 hospitalizat~      116        15         13
## # i 4 more variables: `20 to 39` <dbl>, `40 to 59` <dbl>, `60 to 79` <dbl>,
## #   `80+` <dbl>

1.2.1 Cast our data to a matrix for heatmaps

At this point, we would normally be prepared to visualize our data with ggplot but our data is in wide-format and we’ll be using a package that prefers to work with matrix data. In that case, we need to strip the data down further because matrix data must be all of the same type and we want to work with numeric data! We’ll use as.matrix() for our conversion.

  1. We’ll filter our dataset to only include the “cases” data we saw above, and then drop the category variable from it.
  2. We’ll move the public_health_unit information over to the row names of our matrix so we can still track our data properly.
  3. We’ll still keep our population_2022 column of data but we’ll have to remember that it’s there.
# Now we need to make our matrix and assign row names. 
# It's kind of awkward to need this requirement.

# Cast our matrix and drop the first column
covid_demographics_norm.mx <- covid_demographics_norm_wide.df %>% 
  
  # Filter for just case data
  dplyr::filter(category == "cases") %>% 
  
  # Drop the PHU and category data
  dplyr::select(-c(public_health_unit, category)) %>%
  
  # Convert to a matrix
  as.matrix()

# Set the row names using the information from the data frame
rownames(covid_demographics_norm.mx) <- 
  covid_demographics_norm_wide.df %>% 
  
  # Pull the PHU names
  pull(public_health_unit) %>% 
  
  # Create a unique vector of them
  unique()

# Take a peek at our data
head(covid_demographics_norm.mx)
##                                           population_2022 0 to 4 5 to 11
## Algoma                                             117840   3302    4890
## Brant County                                       153558   4481    5376
## Durham Region                                      711426   4669    5697
## Grey Bruce                                         176150   2324    3040
## Haldimand-Norfolk                                  120008   3002    4956
## Haliburton, Kawartha, Pine Ridge District          190728   2263    2865
##                                           12 to 19 20 to 39 40 to 59 60 to 79
## Algoma                                        4464     7570     4301     2331
## Brant County                                  6499     9448     6429     3995
## Durham Region                                 6407    11018     7333     4985
## Grey Bruce                                    4643     5995     3489     2022
## Haldimand-Norfolk                             5388     9803     5799     3568
## Haliburton, Kawartha, Pine Ridge District     3673     7368     3681     1923
##                                            80+
## Algoma                                    4116
## Brant County                              5684
## Durham Region                             7562
## Grey Bruce                                3056
## Haldimand-Norfolk                         6261
## Haliburton, Kawartha, Pine Ridge District 3612

1.3.0 Plot and reorder our PHUs as a heatmap with Heatmap()

From the ComplexHeatmap package we can use the function Heatmap() to generate a heatmap that can do a little more than what we were doing with our ggplot2 version. A nice aspect of using Heatmap() is that we can ask it to reorder our samples along both the rows and columns. At the same time we can present the relationship between columns elements or row elements in the form of dendrograms.

1.3.1 How do we reorder our data via clustering?

In its current form, we have our ordered our data along the columns in an ascending ordinal arrangement. While this makes sense to us now, it may not help to visually identify the strongest trends or groupings in our data. Clustering attempts to bring order to our data by grouping data according to specific algorithms.

Overall single points are usually grouped together as neighbours with the “nearest” neighbours determined by a metric of some kind. These clusters are then further grouped using the same metrics until the entire set is presented in these ordered groups. These groupings/relationships can also be presented as dendrograms. In our current case, we aren’t concerned with determining groups per se but rather with just connecting them by their similarity and then creating a hierarchical order.

Our data is usually coded in n-dimensional space, depending on the nature of our dataset. For covid_demographics_norm.mx our 7 columns are coded by data from 34 PHUs meaning each column is coded in 34 dimensions or 34 features. Conversely, our 34 rows represent PHUs each coded by data across 7 age groups and therefore 7 dimensions.

Important or helpful parameters to run Heatmap():

  • matrix: our matrix object. It must be a matrix, and not a data frame. It could also be a vector but that becomes a single column.
  • col: determines the colours used for the image - nothing to do with column names.
  • name: the name/title of the heatmap (also use as the legend title by default)
  • row_* : set our row text properties including titles and labels:
    • row_title, row_title_side, row_title_gp (graphic properties)
    • row_names_side row_names_gp
    • column properties can also be changed using column_*
  • cluster_rows and cluster_columns: logical parameters to determine if clustering should occur (default is TRUE)
  • show_heatmap_legend: whether or not to show the heatmap legend
  • heatmap_legend_param: Set the title of the legend specifically is list(title = "x")
  • show_[row/col]_dend: Whether or not to show dendograms for the row/column

There are actually a lot of options but these should help us make a basic one. Let’s plot our data with and without a dendrogram to compare.

?Heatmap
# Create a Heatmap object

cases_hmap <- 
  # Supply our matrix minus the populations size
  Heatmap(covid_demographics_norm.mx[,-1],                                   
                      
          cluster_rows = FALSE, cluster_columns = FALSE, # Don't cluster on either rows or columns

          # Use column_title as the title of our heatmap
          column_title = "Heatmap of COVID-19 cases in PHUs by age group: unclustered",
                      
          # Rotate the legend horizontally and give it a title
          heatmap_legend_param = list(title = "cases per 100K individuals",
                                      legend_direction = "horizontal"),
                      
          # Rotate column names to horizontal
          column_names_rot = 0,
          column_names_center = TRUE
         )

# Plot the heatmap 
ComplexHeatmap::draw(cases_hmap, 
     # Plot the legend on the bottom
     heatmap_legend_side = "bottom"
    )

# Create a Heatmap object
cases_hmap <- 
  # Supply our matrix minus the populations size
  Heatmap(covid_demographics_norm.mx[,-1],         
                      
          cluster_rows = TRUE, cluster_columns = TRUE,   # Cluster on both rows and columns
                      
          # Use column_title as the title of our heatmap
          column_title = "Heatmap of COVID-19 cases in PHUs by age group: clustered",
                      
          # Rotate the legend horizontally and give it a title
          heatmap_legend_param = list(title = "cases per 100K individuals",
                                      legend_direction = "horizontal"),
                      
          # Rotate column names to horizontal
          column_names_rot = 0,
          column_names_center = TRUE
         )

# Plot the heatmap 
ComplexHeatmap::draw(cases_hmap, 
     # Plot the legend on the bottom
     heatmap_legend_side = "bottom"
    )


1.3.2 Concatenate multiple heatmaps together into a HeatmapList object

Our heatmap above is drawn from a single dataset category - cases - but it helps us to see that there are some strong signals that differentiate between our different PHUs. Now this is a 34x7 grid where we can investigate all of the data in our dataset. What happens when we want to produce this kind of data from our other two metrics of hospitalizations and deaths?

To accomplish this, we can repeat our steps and create 3 separate Heatmap objects but we can plot them together as a single complex heatmap. In fact, rather than store multiple heatmap objects, we’ll create a HeatmapList object by concatenating together multiple Heatmap objects using a for loop.

We’ll recreate our code from above but simplify the heatmap code a little. While we’re at it, we’ll also convert our colourscheme to viridis.

# Create a list of heatmap objects from our demographic data
hm_list <- NULL

# Create some quick vectors to help ourselves out
categories <- c("cases", "hospitalizations", "deaths")

# Store the rownames we want to use on our matrices
mx_rownames <- covid_demographics_norm_wide.df %>% pull(public_health_unit) %>% unique()

# Use a loop to separate the data by category
for (i in categories) {
  
  # Create a temporary matrix of the specific category data
  temp_mx <- covid_demographics_norm_wide.df %>% 
      dplyr::filter(category == i) %>% 
      dplyr::select(-c(public_health_unit, category)) %>% 
      as.matrix()
  
  # Set the rownames
  rownames(temp_mx) <- mx_rownames
  
  # Create a Heatmap object
  hmap <- Heatmap(temp_mx[,-1],   # Supply our matrix minus the populations size
                  
                  # Use a viridis colourscale broken into 100 segments
                  col = viridis(100),

                  # Use column_title as the title of our heatmap
                  column_title = i,
                  
                  # Rotate the legend horizontally and give it a title based on category
                  heatmap_legend_param = list(title = paste0(i, " per 100K individuals"),
                                              legend_direction = "horizontal")
                 )
  
  # Add this to our heatmap list
  hm_list <- hm_list + hmap
}

# What kind of object is hm_list?
class(hm_list)
## [1] "HeatmapList"
## attr(,"package")
## [1] "ComplexHeatmap"

1.3.3 draw() your HeatmapList

Now that we have our HeatmapList we can simply draw the whole thing and it will automatically concatenate for us. Of course there are many options we can use to display the list. One thing to note is that the clustering of columns in each heatmap is independent while the clustering of rows (and thus order) is based on only the first heatmap (by default).

Using the ComplexHeatmap::draw() method to display your HeatmapList will allow you to further customize details of the overall figure including setting row_title and column_title for the entire figure.

ComplexHeatmap::draw(hm_list, 
     column_title = "COVID-19 metrics breakdown by age group and PHU\n",
     column_title_gp = gpar(fontsize = 20)    # gpar() sets graphical parameters settings
    )


1.4.0 Building correlation heatmaps with complex RNA-Seq data

Our heatmap above is drawn from a relatively simple dataset but it helps us to see that there are some strong signals that differentiate between our different PHUs. It become clear that while the bulk of cases in each PHU is dominated by the 20-39 age segment, the bulk of hospitalizations and deaths originate from the 80+ segment.

Of course, this data was cumulative information from January 2020 to March 2022. With the aftermath of vaccinations and different variants, a look at the current demographics may reveals a shift in these trends.

Considering that the heatmap is a 34x7 grid, it is easier to visualize and dissect all of the data in our dataset. What happens, however, when we want to produce this kind of visualization from something much larger or more complex?

Recall from last lecture that we investigated read count data from Wyler et al., 2020 - Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv. We used this dataset to produce a scatterplot matrix to compare between some of the replicate data within the set. Across this set there are 36 sets of RNA-Seq data spanning 27,011 genes.

Let’s begin by opening up the data file and getting a quick reminder of what it looks like.

# Read in your read_count data
wyler_readcounts.df <- read_tsv("./data/Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv")
## Rows: 27011 Columns: 38
## -- Column specification -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## Delimiter: "\t"
## chr  (1): gene
## dbl (37): width, AECII_01_24h_DMSO-1, AECII_02_24h_DMSO-2, AECII_03_24h_DMSO...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(wyler_readcounts.df, give.attr = FALSE)
## spc_tbl_ [27,011 x 38] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gene                        : chr [1:27011] "A1BG" "A1BG-AS1" "A1CF" "A2M" ...
##  $ width                       : num [1:27011] 1766 2130 9529 4653 2187 ...
##  $ AECII_01_24h_DMSO-1         : num [1:27011] 22 55 0 0 0 ...
##  $ AECII_02_24h_DMSO-2         : num [1:27011] 19 20 0 1 0 ...
##  $ AECII_03_24h_DMSO-3         : num [1:27011] 15 38 0 0 2 ...
##  $ AECII_04_24h_200nM17AAG-1   : num [1:27011] 29 11 0 0 2 4740 0 0 882 0 ...
##  $ AECII_05_24h_200nM17AAG-2   : num [1:27011] 27 15 0 0 0 ...
##  $ AECII_06_24h_200nM17AAG-3   : num [1:27011] 30 17 0 0 2 4220 0 0 820 0 ...
##  $ AECII_07_24h_S2_DMSO-1      : num [1:27011] 24 41 0 0 0 ...
##  $ AECII_08_24h_S2_DMSO-2      : num [1:27011] 18 46 0 0 2 ...
##  $ AECII_09_24h_S2_DMSO-3      : num [1:27011] 22 42 0 0 4 2750 0 0 739 0 ...
##  $ AECII_10_24h_S2_200nM17AAG-1: num [1:27011] 23 13 0 1 3 ...
##  $ AECII_11_24h_S2_200nM17AAG-2: num [1:27011] 25 24 0 0 1 ...
##  $ AECII_12_24h_S2_200nM17AAG-3: num [1:27011] 18 28 1 0 1 ...
##  $ AECII_13_48h_DMSO-1         : num [1:27011] 24 57 0 0 0 ...
##  $ AECII_14_48h_DMSO-2         : num [1:27011] 38 56 0 0 0 1930 0 0 892 0 ...
##  $ AECII_15_48h_DMSO-3         : num [1:27011] 33 64 0 0 0 ...
##  $ AECII_16_48h_200nM17AAG-1   : num [1:27011] 29 35 0 0 1 ...
##  $ AECII_17_48h_200nM17AAG-2   : num [1:27011] 20 24 0 0 0 ...
##  $ AECII_18_48h_200nM17AAG-3   : num [1:27011] 23 28 0 0 0 ...
##  $ AECII_19_48h_S2_DMSO-1      : num [1:27011] 13 45 0 0 0 ...
##  $ AECII_20_48h_S2_DMSO-2      : num [1:27011] 22 36 0 0 0 ...
##  $ AECII_21_48h_S2_DMSO-3      : num [1:27011] 21 30 0 0 1 ...
##  $ AECII_22_48h_S2_200nM17AAG-1: num [1:27011] 31 29 0 0 1 ...
##  $ AECII_23_48h_S2_200nM17AAG-2: num [1:27011] 36 23 0 0 1 ...
##  $ AECII_24_48h_S2_200nM17AAG-3: num [1:27011] 20 19 1 0 0 ...
##  $ AECII_25_72h_DMSO-1         : num [1:27011] 35 43 0 1 2 ...
##  $ AECII_26_72h_DMSO-2         : num [1:27011] 26 38 0 0 1 997 0 0 753 0 ...
##  $ AECII_27_72h_DMSO-3         : num [1:27011] 29 53 0 0 4 1230 0 0 953 0 ...
##  $ AECII_28_72h_200nM17AAG-1   : num [1:27011] 32 57 0 0 2 ...
##  $ AECII_29_72h_200nM17AAG-2   : num [1:27011] 41 49 0 0 2 1050 0 0 553 0 ...
##  $ AECII_30_72h_200nM17AAG-3   : num [1:27011] 33 37 0 0 3 857 0 0 458 0 ...
##  $ AECII_31_72h_S2_DMSO-1      : num [1:27011] 24 58 0 0 1 ...
##  $ AECII_32_72h_S2_DMSO-2      : num [1:27011] 33 63 0 0 3 ...
##  $ AECII_33_72h_S2_DMSO-3      : num [1:27011] 23 47 1 1 2 ...
##  $ AECII_34_72h_S2_200nM17AAG-1: num [1:27011] 22 39 0 1 1 ...
##  $ AECII_35_72h_S2_200nM17AAG-2: num [1:27011] 28 26 0 0 2 866 0 0 481 0 ...
##  $ AECII_36_72h_S2_200nM17AAG-3: num [1:27011] 43 34 0 0 3 ...

1.4.1 Wrangle our readcount data

As you might recall from last week, there is quite a bit of data in this set. So that we can more easily visualize our data, we’ll want to wrangle it a bit more by:

  1. updating the column names to remove some of the unnecessary title information.
  2. filtering by readcounts to limit our genes to those with values between 1000 and 3000.
  3. Convert our dataframe to a matrix and rename the rows using the genes.

Why do we filter our read counts? In this instance we have chosen to filter our read counts. In some cases, you might find wild swings in expression, and it is what we’re looking for most of the time. For in-class analysis, to save a bit on memory and time, we try to limit the read counts to a narrow set to reduce our set size. It will also greatly reduce the size of our heatmap for viewing. By filtering, we’ll go from 27,011 observations to 109.

wyler_readcounts_filtered.df <-
    
    wyler_readcounts.df %>% 

    # Rename the columns be removing the first portion: AECII_xx
    rename_with(., ~ str_replace(string = .x, 
                                 pattern = r"(\w*_\d*_)", 
                                 replace = "")) %>% 

    # filter out the low-readcount data
    filter(if_all(.cols = -c(1,2), .fns = ~ .x > 1000 & .x < 3000))

# Create our data matrix
wyler_readcounts.mx <- as.matrix(wyler_readcounts_filtered.df[, 3:38])

# Set the row names using the information from the data frame
rownames(wyler_readcounts.mx) <- wyler_readcounts_filtered.df$gene

# Check the characteristics of our matrix
head(wyler_readcounts.mx)
##         24h_DMSO-1 24h_DMSO-2 24h_DMSO-3 24h_200nM17AAG-1 24h_200nM17AAG-2
## AGPAT3        2115       1558       1877             1973             1721
## AHCYL1        1878       1531       1707             2084             1912
## ANKLE2        2099       1401       1787             1296             1156
## ANKRD17       2017       1487       1900             2265             1983
## ANO6          2205       1563       1829             2112             1873
## AP1G1         2204       1649       2002             2175             1996
##         24h_200nM17AAG-3 24h_S2_DMSO-1 24h_S2_DMSO-2 24h_S2_DMSO-3
## AGPAT3              1907          1826          1613          1967
## AHCYL1              2114          1704          1619          1935
## ANKLE2              1202          1778          1761          2083
## ANKRD17             2205          1730          1699          2015
## ANO6                1971          1882          1828          2214
## AP1G1               1998          1780          1769          2120
##         24h_S2_200nM17AAG-1 24h_S2_200nM17AAG-2 24h_S2_200nM17AAG-3 48h_DMSO-1
## AGPAT3                 1921                1661                1822       1905
## AHCYL1                 2109                1845                2034       2370
## ANKLE2                 1377                1336                1257       2056
## ANKRD17                2447                2367                2514       1967
## ANO6                   1978                1847                1907       1985
## AP1G1                  2094                1901                2069       1939
##         48h_DMSO-2 48h_DMSO-3 48h_200nM17AAG-1 48h_200nM17AAG-2
## AGPAT3        2066       1986             1885             1598
## AHCYL1        2739       2548             2257             1815
## ANKLE2        2320       2163             2013             1661
## ANKRD17       2108       2064             2303             2067
## ANO6          2203       2194             2512             2191
## AP1G1         2166       2019             1589             1349
##         48h_200nM17AAG-3 48h_S2_DMSO-1 48h_S2_DMSO-2 48h_S2_DMSO-3
## AGPAT3              1864          1835          2111          1717
## AHCYL1              2367          1716          1796          1565
## ANKLE2              1838          1690          1888          1656
## ANKRD17             2544          1480          1597          1297
## ANO6                2537          1532          1713          1376
## AP1G1               1781          1506          1681          1454
##         48h_S2_200nM17AAG-1 48h_S2_200nM17AAG-2 48h_S2_200nM17AAG-3 72h_DMSO-1
## AGPAT3                 1789                1833                1670       2269
## AHCYL1                 1990                2274                1825       2759
## ANKLE2                 2108                2256                1744       1865
## ANKRD17                2304                2488                2159       2283
## ANO6                   2476                2687                2555       2254
## AP1G1                  1488                1751                1672       2233
##         72h_DMSO-2 72h_DMSO-3 72h_200nM17AAG-1 72h_200nM17AAG-2
## AGPAT3        2147       2467             1853             1764
## AHCYL1        2511       2919             2179             2077
## ANKLE

## 7SK            0.00000      0.0000        0.0000     0.000000      0.00000
## A1BG          23.72684      0.0000        0.0000     2.972536      0.00000
## A1BG-AS1       0.00000      0.0000        0.0000     0.000000      0.00000
## A1CF           0.00000      0.0000        0.0000     0.000000      0.00000
##           case5.lung1 case5.lung2 case5.lung3 case5.lung4 case5.fat1
## 5S_rRNA    17.8301501   204.14617   5.2150746   6.5310519    0.00000
## 5_8S_rRNA   9.5673976    47.63411   2.8248321   3.4832277    0.00000
## 7SK         0.8697634     0.00000   0.2172948   0.0000000   66.65604
## A1BG        0.0000000     0.00000   0.6518843   0.4354035    0.00000
## A1BG-AS1    3.4790537     0.00000   1.0864739   0.8708069    0.00000
## A1CF        0.0000000     0.00000   0.0000000   0.0000000    0.00000
##           case5.lung5 case5.skin1 case5.bowel1 case5.liver1 case5.kidney1
## 5S_rRNA     4.7856618   1.8975379    4.2612716   261.001755     95.246937
## 5_8S_rRNA   1.5952206   0.3795076    0.4734746    66.169459     38.098775
## 7SK         0.0000000   1.8975379    0.4734746     0.000000      5.195287
## A1BG        0.1994026   0.3795076    0.0000000     7.352162      0.000000
## A1BG-AS1    0.1994026   0.0000000    0.2367373     3.676081      3.463525
## A1CF        0.0000000   0.0000000    0.0000000   102.930270      0.000000
##           case5.heart1 case5.marrow1 NegControl1 NegControl2 NegControl3
## 5S_rRNA      6.0474605      9.419146   1.0790885  108.314409   4.0667302
## 5_8S_rRNA    3.9308493      7.326002   0.8992404   56.091390   1.3555767
## 7SK          0.3023730      0.000000   0.0000000    0.000000   1.5815062
## A1BG         0.9071191      0.000000   0.0000000    0.000000   0.2259295
## A1BG-AS1     0.6047461      1.046572   1.0790885    2.901279   1.1296473
## A1CF         0.3023730      0.000000   0.0000000    0.000000   0.0000000
##           NegControl4 NegControl5 case6.lung1 case6.lung2 case6.lung3
## 5S_rRNA    14.1234049  26.5186080  80.5830895  129.256507     60.0512
## 5_8S_rRNA   4.4600226   4.1435325  24.7947968   33.142694     10.9184
## 7SK         0.9911161   0.8287065   0.0000000    0.000000     21.8368
## A1BG        0.2477790   0.0000000   0.0000000    3.314269      8.1888
## A1BG-AS1    1.4866742   0.0000000   0.0000000    0.000000      5.4592
## A1CF        0.2477790   0.0000000   0.8855285   36.456964     27.2960
##           case6.lung4 case6.lung5 case7.lung1 case7.lung2 case7.lung3
## 5S_rRNA       0.00000   87.316422   51.187613  16.8835938    5.459087
## 5_8S_rRNA    30.51539   11.713179   13.650030   4.1068201    2.729543
## 7SK          42.72155    2.129669    5.118761   0.4563133   13.647717
## A1BG         48.82462    0.000000    8.531269   0.4563133    2.729543
## A1BG-AS1     18.30923    0.000000    1.706254   0.4563133    0.000000
## A1CF        109.85540    0.000000   34.125075   2.2815667    9.553402
##           case7.lung4 case7.lung5 case8.lung1 case8.lung2 case8.lung3
## 5S_rRNA     96.976830  20.3623448   60.256169  250.998904   23.725512
## 5_8S_rRNA   11.409039   3.0543517   24.102468   59.447109    5.931378
## 7SK          0.000000   0.0000000    0.927018    0.000000    0.659042
## A1BG         5.704519   0.0000000    0.000000    0.000000    0.000000
## A1BG-AS1     0.000000   1.0181172    0.000000    3.302617    0.659042
## A1CF         5.704519   0.5090586    0.000000    0.000000    0.000000
##           case8.lung4 case8.lung5 case8.liver1 case8.bowel1 case8.heart1
## 5S_rRNA    16.2110781  11.4957577    318.54089     42.04929   167.454051
## 5_8S_rRNA   0.8532146   2.8739394     75.84307     23.76699    50.236215
## 7SK         0.0000000   0.4105628      0.00000      0.00000     1.860601
## A1BG        0.0000000   0.0000000     34.12938      0.00000     1.860601
## A1BG-AS1    0.4266073   0.4105628      0.00000      0.00000     0.000000
## A1CF        0.0000000   0.0000000    284.41151      0.00000     0.000000
##           case9.lung1 case9.lung2 case9.lung3 case9.lung4 case9.lung5
## 5S_rRNA    30.4508211  26.1531478  10.9642819   6.5572268  42.6334973
## 5_8S_rRNA  10.7846658   3.5342092   1.7684326   2.0983126   8.1987495
## 7SK         3.1719605   2.1205255   0.7073730   0.0000000   3.2794998
## A1BG        0.0000000   0.7068418   0.0000000   0.2622891   0.0000000
## A1BG-AS1    0.0000000   0.7068418   0.3536865   0.5245781   0.8198749
## A1CF        0.6343921   0.7068418   0.0000000   0.0000000   2.4596248
##           case10.lung1 case10.lung2 case10.lung3 case11.lung1 case11.lung2
## 5S_rRNA     226.248915    69.321897   27.6902247   14.1541194   18.8911136
## 5_8S_rRNA    47.439289    22.218557   15.0318363    6.2907197   13.2937466
## 7SK           0.000000     0.000000    0.0000000    0.5242266    1.3993418
## A1BG          0.000000     0.000000    0.0000000    0.0000000    1.3993418
## A1BG-AS1      0.000000     0.000000    0.7911493    0.5242266    1.3993418
## A1CF          3.649176     1.777485    0.0000000    0.5242266    0.6996709
##           case11.lung3 case11.kidney1 case11.bowel1 caseA.lung.NYC
## 5S_rRNA     64.7298859     145.718388    29.0709229       3.216111
## 5_8S_rRNA   13.9757708      35.080353     7.8100987       0.000000
## 7SK          1.4711338       2.698489     0.4338944       0.000000
## A1BG         0.7355669       0.000000     0.0000000       3.216111
## A1BG-AS1     2.2067007       0.000000     0.4338944       0.000000
## A1CF         0.0000000      10.793955     0.0000000      41.809449
##           caseB.lung.NYC caseC.lung.NYC caseD.lung.NYC caseP1.placenta1
## 5S_rRNA        6.8112310     16.7622456       64.32178        17.470270
## 5_8S_rRNA      2.7864127      9.3123587       25.72871         4.031601
## 7SK            0.6192028      1.2416478        0.00000         0.000000
## A1BG           0.0000000      0.0000000        0.00000         0.000000
## A1BG-AS1       1.2384056      0.6208239        0.00000         0.000000
## A1CF           0.3096014      0.0000000        0.00000         1.343867
##           caseP1.placenta2 caseP2.placenta1 caseP2.placenta2 caseP3.placenta1
## 5S_rRNA          12.483367        8.7605654         9.788657        3.4054256
## 5_8S_rRNA         3.640982        3.9422544         2.719072        1.0478233
## 7SK               8.322244        4.8183110        69.608231      117.8801161
## A1BG              1.040281        0.0000000         0.000000        0.2619558
## A1BG-AS1          0.000000        0.8760565         0.000000        0.2619558
## A1CF              0.000000        0.4380283         0.000000        0.0000000
##           caseP3.placenta2 caseP4.placenta1 caseF.lung.NYC caseE.lung.NYC
## 5S_rRNA          16.042997        5.1152960      2.0354397      13.931242
## 5_8S_rRNA         5.347666        0.2692261      0.2907771      10.216244
## 7SK              52.504353        2.9614871      0.0000000       2.786248
## A1BG              0.000000        0.2692261      0.0000000       0.000000
## A1BG-AS1          0.000000        0.2692261      0.2907771       0.000000
## A1CF              0.000000        0.0000000      0.0000000       0.000000
##           caseG.lung.NYC caseH.lung.NYC caseI.lung.NYC caseJ.lung.NYC
## 5S_rRNA        7.0430417      3.5559007     222.868159       8.437777
## 5_8S_rRNA      3.3143726      0.0000000      81.613692       7.500246
## 7SK            0.8285931      1.6163185       0.000000       2.343827
## A1BG           0.0000000      0.0000000       0.000000       0.000000
## A1BG-AS1       0.0000000      0.9697911       0.000000       0.000000
## A1CF           0.0000000      0.3232637       9.416964       0.000000
##           case3.liver1 case10.liver1 case12.liver1
## 5S_rRNA     117.519303    156.222134    4907.10184
## 5_8S_rRNA    42.476857     32.888870     687.61937
## 7SK           1.415895      8.222218       0.00000
## A1BG          8.495371      0.000000      31.25543
## A1BG-AS1      4.247686      0.000000       0.00000
## A1CF         89.201399     73.999958       0.00000
dim(tissue_data.df)
## [1] 59090    88
# Read in some additional patient data
patient_data.df <- read_excel("./data/2020.07.30.20165241-supp_tables.xlsx", sheet=2)

# Take a quick look at it
head(patient_data.df)
## # A tibble: 6 x 18
##   `Case\r\nNo` `Viral\r\nhigh\r\nvs.\r\nviral\r\nlow*` Viral\r\nload\r\n(%\r\n~1
##   <chr>        <chr>                                   <chr>                    
## 1 1            High                                    81.2                     
## 2 2            Low                                     0.5                      
## 3 3            Low                                     2                        
## 4 4            Low                                     <0.01                    
## 5 5            High                                    18.5                     
## 6 6            Low                                     0.02                     
## # i abbreviated name:
## #   1: `Viral\r\nload\r\n(%\r\npositive\r\ntissue\r\narea by\r\nRNA\r\nISH)`
## # i 15 more variables: `ISH\r\nhyaline\r\nmembrane\r\nreactivity` <chr>,
## #   `RNA\r\nseq\r\ncoverage` <chr>, `qRT\r\nPCR**` <chr>,
## #   `Keratin\r\ncells /\r\nmm2` <chr>, `Napsin\r\nA cells\r\n/ mm2` <chr>,
## #   `CD1 63\r\ncells/\r\nmm2` <chr>, `CD 3\r\nCells/\r\nmm2` <chr>,
## #   `CD 4\r\ncells/\r\nmm2` <chr>, `CD8\r\ncells/\r\nmm2` <chr>, ...
dim(patient_data.df)
## [1] 24 18

4.0.1 Why can’t we use PCA on this dataset?

Before we attempt one of our non-linear projection methods, let’s try and use some of the methods we’ve spent all this time looking at. We’ll naively repeat our PCA steps on this tissue data and see if we can pull any information from it. We know there are 10 tissue groups in our data so we’ll use that as a basis for clustering.


As we can see from the above plot, a majority of our samples from various tissue types fall into a single cluster. What’s more, as we’ll see, there is some structure in these samples as multiple samples come from the same patient. Sometimes even the same tissue! These relationships may have a large effect in the overall variation we see in the data, making it hard to distinguish between even tissue types. This is why we’ll turn to non-linear projection and see if it can help us out. Before that we’ll need to do some more wrangling…

4.0.2 Reformat our patient meta data

First let’s examine our patient_data.df which has 24 patient samples (aka cases) listed in total. These 24 cases relate back to tissue_data.df which have variables representing different combinations of case and tissue type and observations for some 59K transcripts.

At this point we want to reformat the column names in patient_data.df a bit before selecting for just the viral load and viral load percentage information located in the 2nd and 3rd column. We’ll hold onto this in patient_viral_load.df for later use.

# Create a dataframe holding just the viral_load information for each patient
patient_viral_load.df <-
  patient_data.df %>% 

  # Retain just the first 3 columns
  dplyr::select(1:3) %>% 

  # Rename the 2nd and 3rd columns
  dplyr:rename(case_num = 1, viral_load = 2, viral_load_percent = 3)
## Error in patient_data.df %>% dplyr::select(1:3) %>% dplyr:rename(case_num = 1, : object 'dplyr' not found
head(patient_viral_load.df)
## Error in head(patient_viral_load.df): object 'patient_viral_load.df' not found

4.0.3 Prepare our RNAseq tissue data for analysis by eliminating features

We now want to format tissue_data.df much in the way we did with our PHU data. We want to convert our current data which lists genes as observations and tissue samples as columns. Essentially, we’d like to transpose this and we could do that but the transposition converts everything to a matrix. In the end, we want to work with a data frame so we can hold more information.

To reduce on memory and runtime, however, we should trim our dataset. We aren’t really interested in looking at 59,090 transcripts - many of which may be barely expressed. Since these are normalized counts across the dataset, we can filter out low-expression genes to make tissue_data_filtered.df. Yes, this would again be considered a form of feature elimination. In general we will:

  1. Convert the row names to their own column.
  2. Calculate the mean of the normalized counts across each observation and filter for a minimum of 0.5.
  3. Calculate the variance across each observation as a backup plan.
system.time(
# Trim the tissue data down
tissue_data_filtered.df <-

  tissue_data.df %>% 

  # Convert the row names to an actual column
  rownames_to_column(var="gene") %>% 

  # Set up the table to perform row-wise operations
  rowwise() %>% 

  # Calculate the mean expression of each gene across all tissue samples
  mutate(mean = ...(c_across(where(is.numeric)))) %>% 

  # Filter for samples with low expression
  filter(mean > 0.5) %>% 

  # Calculate overall variance in case we need to make our dataset smaller
  mutate(variance = ...(c_across(where(is.numeric)))) %>% 

  # Arrange samples by descending variance
  arrange(desc(variance)) %>% 

  # Remove the grouping specification
  ungroup()

)
## Error in `mutate()`:
## i In argument: `mean = ...(c_across(where(is.numeric)))`.
## i In row 1.
## Caused by error in `...()`:
## ! could not find function "..."
## Timing stopped at: 0.05 0 0.06
# Take a look at the final results
head(tissue_data_filtered.df)
## Error in head(tissue_data_filtered.df): object 'tissue_data_filtered.df' not found
# how big is our filtered data frame?
dim(tissue_data_filtered.df)
## Error in eval(expr, envir, enclos): object 'tissue_data_filtered.df' not found

4.0.4 More data wrangling to transpose our RNAseq data

Now that we’ve filtered our data down to ~29k genes, we’ll run through two more steps:

  1. We’ll transpose our data using a combination of pivot_longer() and pivot_wider().
  2. We’ll use a series of string matches and joins to add some sample information to our data. It won’t be used for our analysis but will be used when we plot the results!
  3. Lastly we’ll convert the relevant parts of our data frame to a matrix. There is a huge memory savings in the algorithm when working with a matrix and you have limited memory on your RStudio Hubs!
# You need to transpose the data. 
# We can do it with dplyr to keep it as a data frame and to add some info

tissue_RNAseq.df <-

  tissue_data_filtered.df %>% 
  
  # trim down the columns to drop mean/variance
  dplyr::select(1:89) %>% 

  # pivot longer
  pivot_longer(cols=c(2:89), names_to = ..., values_to = ...) %>% 

  # redistribute the gene names to columns
  pivot_wider(names_from = ..., values_from = ...)
## Error in tissue_data_filtered.df %>% dplyr::select(1:89) %>% pivot_longer(cols = c(2:89), : '...' used in an incorrect context
# Take a quick look at the transposed result
head(tissue_RNAseq.df)
## Error in head(tissue_RNAseq.df): object 'tissue_RNAseq.df' not found
# We want to add some additional sample information before assessing the data

tissue_RNAseq.df <-

  tissue_RNAseq.df %>% 

  # Grab just the sample names
  dplyr::select(sample) %>% 

  # Grab information from it like case number, tissue, and tissue number
  # takes the form of "caseX.tissueY" or "caseX.tissue.NYC" or "NegControlX"
  # Remember that this returns a LIST of character matrices
  str_match_all(., pattern=r"(case([\w]+)\.([a-z]+)([\d|\.NYC]*)|(NegControl\d))") %>% 

  # Bind all the matrices from all the list elements together in a single object (likely a matrix)
  do.call(rbind, .) %>% 

  # Convert the results to a data frame
  as.data.frame() %>% 

  # Rename the columns based on the capture groups
  dplyr::rename(., sample = V1, case_num = V2, tissue = V3, tissue_num = V4, neg_num = V5) %>%

  # Coalesce some of the info due to negative control samples and clean up a column
  mutate(case_num = coalesce(case_num, neg_num),
         tissue_num = str_replace_all(.$tissue_num, pattern = "\\.", replace = "")) %>%
  
  # Drop the neg_num column
  dplyr::select(1:4) %>%
  
  # Join this result to the RNAseq info
  full_join(., y=tissue_RNAseq.df, by=c("sample" = "sample")) %>%
  
  # Join that result to grab viral load information
  right_join(patient_viral_load.df, y=., by=c("case_num" = "case_num"))
## Error in right_join(patient_viral_load.df, y = ., by = c(case_num = "case_num")): object 'patient_viral_load.df' not found
# Look at the resulting dataframe
head(tissue_RNAseq.df)
## Error in head(tissue_RNAseq.df): object 'tissue_RNAseq.df' not found
# How many tissue types do we have?
table(tissue_RNAseq.df$tissue)
## Error in table(tissue_RNAseq.df$tissue): object 'tissue_RNAseq.df' not found
# Generate a matrix version of our data but drop the sample metadata!
tissue_RNAseq.mx <- as.matrix(tissue_RNAseq.df[, ...])
## Error in as.matrix(tissue_RNAseq.df[, ...]): object 'tissue_RNAseq.df' not found
# head(tissue_RNAseq.mx)
str(tissue_RNAseq.mx)
## Error in str(tissue_RNAseq.mx): object 'tissue_RNAseq.mx' not found
dim(tissue_RNAseq.mx)
## Error in eval(expr, envir, enclos): object 'tissue_RNAseq.mx' not found

4.1.0 t-Distributed Stochastic Neighbour Embedding with the Rtsne package

We now have a somewhat more complex dataset. We are still short on actual samples (now observations) but 88 observations and nearly 30K features isn’t so bad. A broad question we may wish to ask with such a data set is if there is an underlying structure to these samples - i.e. do we see grouping based on tissue type, or perhaps even sample preparation.

t-Distributed Stochastic Neighbour Embedding or t-SNE is a way for us to project our high-dimension data onto a lower dimension with the aim at preserving the local similarities rather than global disparity. When looking at data points, t-SNE will attempt to preserve the local neighbouring structure. As the algorithm pours over the data, it can use different transformations for different regions as it attempts to transform everything to a lower dimension. It is considered “incredibly flexible” at finding local structure where other algorithms may not.

This flexibility is accomplished through 2 steps:

  1. Reduce dimensionality of your data features with PCA!
  2. Generate a probability distribution between all pairs by making similar objects highly probably and assigning dissimilar objects a low probability.
  3. Define a similar probability distribution for the samples in a lower dimension while minimizing the divergence between the two distributions based on a distance metric between points in the lower dimension.

We’ll discuss more about how this algorithm affects interpretation after seeing the results, but this is considered an exploratory data visualization tool, rather than explanatory.

To produce our t-SNE projection we’ll use the Rtsne() function from the package of the same name. Some important parameters are:

  • X is our data matrix where each row is an observation

  • dims sets the number of dimensions we’d like to project onto (default is 2).

  • initial_dims sets the number of dimensions that should be retained in the initial PCA step (Default 50).

  • perplexity a numeric parameter that tunes between local and global aspects of your data.

    • This parameter is a guess as to how many close neighbours a point may have. If you have a sense of sample types (or clusters!) ahead of time, you could try to play with this value (default is 30).

    • According to the algorithm this value should follow this rule: \(perplexity * 3 \lt nrow(X) -1\)

  • pca_scale is a boolean to set if the initial PCA step should use scaled data.

  • max_iter is the maximum number of iterations in the algorithm (default is 1000).

# Rtsne prefers using a matrix for memory issues
# set a seed for reproducible results
set.seed(2025)

# Try for perplexity of 30 can go as high as 29 before crash
# We have just 90 samples, but between 1-52 samples per "group"
tissue_tsne <- Rtsne(..., 
                     dims=2, 
                     perplexity=5, 
                     verbose=TRUE, 
                     pca_scale = TRUE,
                     max_iter = 1500) 
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# What does our t-SNE object look like?
str(...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

4.1.1 Extract information from our tsne object

Looking above at the result of our t-SNE analysis, we can notice a few things…

  1. We get back the number of objects we put in: 88.
  2. The element Y is an 88x2 matrix holding the coordinates for our new projection.
  3. There are no eigenvectors or other information about the variables or dimensions or how they were used in the projection!

That’s right, there is no underlying information for mapping back to our original dataset. It’s a completely black box with no way to reverse engineer the process. That’s because the process itself is stochastic! Whereas PCA was a deterministic process - repeatable the same way every time with the same data - that is not the case for t-SNE. That’s why even though it can be quite powerful in identifying local similarities, t-SNE does not provide a mathematical pathway back to our original data!

Let’s extract the information, combine it with our sample information and project it using ggplot2. We can do this with a scatterplot since we have a set of x/y coordinates we can work with.

# Build our new data frame from the Y values
tissue_tsne.df <- data.frame(x.coord = ...,
                             y.coord = ...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Add in our sample information
tissue_tsne.df <- cbind(tissue_RNAseq.df[,1:6], tissue_tsne.df)
## Error in cbind(tissue_RNAseq.df[, 1:6], tissue_tsne.df): object 'tissue_RNAseq.df' not found
# Fix up the information just a little bit to remove NA viral load information
tissue_tsne.df <-
  tissue_tsne.df %>% 
  # replace NAs with DNW (did not work)
  mutate(viral_load = replace_na(viral_load, replace = "DNW"))
## Error in mutate(., viral_load = replace_na(viral_load, replace = "DNW")): object 'tissue_tsne.df' not found
head(tissue_tsne.df)
## Error in head(tissue_tsne.df): object 'tissue_tsne.df' not found
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(12, "Set3"), brewer.pal(8, "Set1"))

# 1. Data
ggplot(data = tissue_tsne.df) +
  # 2. Aesthetics
  aes(x = x.coord, 
      y = y.coord, 
      colour = ...) +

  # Themes
  theme_bw() +
  theme(text = element_text(size=20)) +

  # 3. Scaling
  scale_colour_manual(values = combo.colours) +

  # 4. Geoms
  geom_text(aes(label = case_num), size = 10)
## Error in ggplot(data = tissue_tsne.df): object 'tissue_tsne.df' not found

4.1.2 Interpreting our t-SNE plot

While we don’t have a lot of samples, you can still see that we were able to cluster some of our data by cell types without providing that classification to the algorithm! Great job team!

We can see that we get close clustering of tissues like placenta, heart, and bowel. Our liver samples are kind of everywhere but perhaps using a different perplexity would provide different results.

One interesting thing we can see is that regardless of tissue type, we see some samples are clustering together based on case number - namely case numbers 1, 4 and 5 seem to have some strong underlying expression profiles that connect them across tissue samples. We may also be seeing false relationships so beware!


4.2.0 Uniform Manifold Approximation and Projection

This algorithm for projection is in the same flavour as t-SNE projection but has some differences including:

  1. Increased speed and better preservation of the global structure in your data
  2. A different theoretical foundation used to balance between local and global structure

What does that mean for us? Faster results, and more interpretive results! Otherwise the same issues can apply. The setup is slightly easier with few options to change if you leave the defaults. We can access umap() from the package of the same name. You may also alter the default methods by creating a umap.defaults object. More information on that here

For more tinkering, you can choose to use the uwot package instead where the umap() function has more options that are easily modified.

# Set our seed
set.seed(2025)

# Generate our projection
tissue_umap <- ...(tissue_RNAseq.mx)
## Error in ...(tissue_RNAseq.mx): could not find function "..."
# What does the UMAP object look like?
str(tissue_umap)
## Error in str(tissue_umap): object 'tissue_umap' not found

4.2.1 Extract information from our UMAP object

Looking at our UMAP object tissue_umap we see it houses the projection parameters used but also some additional variables:

  1. data: holds our original data matrix.
  2. layout: contains the projection coordinates we need for plotting the data.
  3. knn: a weighted k-nearest neighbour graph. This is a graph that connects each observation to its nearest k neighbours. This generates the first topological representation of the data - like an initial sketch.

You may notice again that there is no data that suggests how we arrived at this solution. There are no eigenvectors or values to reverse the projection!

Let’s extract the layout information, combine it with our sample information and project it using ggplot2

# Re-map our projection points with our tissue data

tissue_umap.df <- data.frame(x.coord = tissue_umap$...,
                             y.coord = tissue_umap$...)
## Error in data.frame(x.coord = tissue_umap$..., y.coord = tissue_umap$...): object 'tissue_umap' not found
tissue_umap.df <- cbind(tissue_RNAseq.df[,1:6], tissue_umap.df)
## Error in cbind(tissue_RNAseq.df[, 1:6], tissue_umap.df): object 'tissue_RNAseq.df' not found
tissue_umap.df <-
  tissue_umap.df %>% 
  # replace NAs with DNW (did not work)
  mutate(viral_load = replace_na(viral_load, replace = "DNW"))
## Error in mutate(., viral_load = replace_na(viral_load, replace = "DNW")): object 'tissue_umap.df' not found
head(tissue_umap.df)
## Error in head(tissue_umap.df): object 'tissue_umap.df' not found
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(12, "Set3"), brewer.pal(8, "Set1"))

# 1. Data
ggplot(data = ...) +
  # 2. Aesthetics
  aes(x = x.coord, 
      y = y.coord, 
      colour = tissue) +

  # Themes
  theme_bw() +
  theme(text = element_text(size=20)) +

  # 3. Scaling
  scale_colour_manual(values = combo.colours) +

  # 4. Geoms
  geom_text(aes(label = case_num), size = 10)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

4.2.2 Interpreting our UMAP result

So it looks like without much tinkering we retrieved a fairly nice result. We can see now that liver and kidney are more closely grouped as tissues, while heart samples generally cluster together still. Bowel and jejunum appear spatially grouped and our placenta samples are still close to each other. The clustering we saw with samples 1, 4, and 5 appear to be less severe.

There does appear to be some structure between the lung samples in different case numbers so this might be an avenue to explore next to try and see if there truly is a relationship between these groups.


5.0.0 Class summary

While t-SNE and UMAP produce projections onto a 2-dimensional plane, they did not strictly clustered the data. There is not labeling produced and you have no route back to understanding the relationships between closely plotted points. PCA, on the other hand, is strictly a dimension reduction tool. It does not place or assign datapoints to any groups BUT it is useful to use on large datasets prior to clustering!

Today we took a deep dive into principal component analysis. There are of course different variants of this based on the assumptions you can make about your observations and variables like independent component analysis (ICA, non-Gaussian features) and multiple correspondence analysis (MCA, categorical features). Some additional methods can also be used to store the transformation like a PCA does, notably principal coordinate analysis (PCoA) and variational autoencoders (VAE).

Overall we should remember that while PCA can have problems in generating it’s feature extraction, it is deterministic and repeatable. Also, the final results are provided in such a way that new observations could be transformed and projected onto the same principal components. You can also feed these components back into clustering algorithms like k-means to try and identify specific subgroups.

t-SNE and UMAP, on the other hand appear to do a much better job with high-dimensional data. They can preserve local structure and UMAP can also do a fairly good job of preserving global structure. These tools make for great exploratory analysis of your complex datasets. Interpretation of relationships, however, are not mathematically clear like in PCA. These are, after all projections from a higher dimension for our simpler primate brains!


5.1.0 Weekly assignment

This week’s assignment will be found under the current lecture folder under the “assignment” subfolder. It will include an R markdown notebook that you will use to produce the code and answers for this week’s assignment. Please provide answers in markdown or code cells that immediately follow each question section.

Assignment breakdown
Code 50% - Does it follow best practices?
- Does it make good use of available packages?
- Was data prepared properly
Answers and Output 50% - Is output based on the correct dataset?
- Are groupings appropriate
- Are correct titles/axes/legends correct?
- Is interpretation of the graphs correct?

Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.

You can save and download the markdown notebook in its native format. Submit this file to the the appropriate assignment section by 12:59 pm on the date of our next class: April 17th, 2025.


5.2.0 Acknowledgements

Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 2.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2024 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 2.0.1: Revised and prepared for CSB1020H S LEC0141, 03-2025 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


The Center for the Analysis of Genome Evolution and Function (CAGEF)

The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.

From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.

For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.